DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage

The extensive information capacity of DNA, coupled with decreasing costs for DNA synthesis and sequencing, makes DNA an attractive alternative to traditional data storage. The processes of writing, storing, and reading DNA exhibit specific error profiles and constraints DNA sequences have to adhere to. We present DNA-Aeon, a concatenated coding scheme for DNA data storage. It supports the generation of variable-sized encoded sequences with a user-defined Guanine-Cytosine (GC) content, homopolymer length limitation, and the avoidance of undesired motifs. It further enables users to provide custom codebooks adhering to further constraints. DNA-Aeon can correct substitution errors, insertions, deletions, and the loss of whole DNA strands. Comparisons with other codes show better error-correction capabilities of DNA-Aeon at similar redundancy levels with decreased DNA synthesis costs. In-vitro tests indicate high reliability of DNA-Aeon even in the case of skewed sequencing read distributions and high read-dropout.


Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.
Sample size

Data exclusions
Replication Randomization Blinding SRR19954696, SRR19954697, and SRR19954694, BioProject accession PRJNA855029. The error-correction and rate analysis data generated in this study are provided in the Supplementary Information and Source Data files. The raw data of the sequence processing parameter evaluations are provided in the Source Data files. The MESA configuration file used for the cost analysis under realistic conditions is provided in the Source Data files. The final parameters used for the rate analysis are available in the Source Data files. The encoded data, split into 10mers as used for the mCGR evaluation, are available in the Source Data files. Source data are provided with this paper. No data restrictions apply.
Encoded data was permuted at random positions, starting with 0 errors and with 500 additional errors per evaluation step (substitution evaluation) for each code evaluated, until the code was not able to decode any of the encoded data. The increase of 500 errors per evaluation step was chosen to be sufficiently small so that for each code, the amount of errors required for the first occurrence of decoding failures represents a BER (errors/total encoded bases) increase of less than 1 % compared to the last BER for which no decoding failure was observed. For the substitution / insertion / deletion evaluation, from the base error rates that where described in 10.1073/pnas.2004821117 (high mutagenesis), we started with a multiplier of 0.1 and gradually increased the multiplier by 0.5, until the codes were not able to decode any of the encoded data. For the biological data, we gradually decreased the amount of reads for each dataset, first in steps of 10 %, if decoding was still possible with 10 % of the input data we tested 5 %, 2 % and 1 %, and finally, if the data could be decoded with 1 % of the input reads, in steps of 0.1 %, until the decoding failed. This approach allowed us to bin the data into combinations of processing parameters, encoding parameters and input files that require more than 10 % of the sequencing reads, between 10 % and 1 %, and less than 1 % of the input reads.
For the rate analysis, we used the same approach as for the substitution / insertion / deletion evaluation, using the error rates as described by Press et al. and gradually increased the multiplier by 0.5. For a given base error rate, we adjusted the code parameters until the decoding was successful 100 out of 100 times with as few encoded bases as possible was achieved. We stopped at a multiplier of 2.5, as this represents storage of around 150 years in nature (10.1098/rspb.2012.1745) and an error rate of potentially millions of years under optimal conditions (10.1002/anie.201411378).
No data was excluded.
Each error-correction performance comparison step was repeated 100 times, the amount of times the decoding was successful for each step and code is reported in the manuscript. The data was sequenced once for economic reasons. The PCR amplification product visualization on two % agarose gel was carried out a single time, to confirm that the PCR product included the target sequences before sequencing. For the evaluation of the impact of sequencing processing parameters on the decodeability of files, the evaluation was carried out a single time for each parameter combination, file, and read percentage, with the different files serving as repetitions. All files evaluated showed the same trend, that less restrictive preprocessing parameters lead to less raw data required.
The errors were distributed randomly for each repetition.
Blinding is not relevant to this study, as the evaluation was automated.
Briefly describe the study type including whether data are quantitative, qualitative, or mixed-methods (e.g. qualitative cross-sectional, quantitative experimental, mixed-methods case study).
State the research sample (e.g. Harvard university undergraduates, villagers in rural India) and provide relevant demographic information (e.g. age, sex) and indicate whether the sample is representative. Provide a rationale for the study sample chosen. For studies involving existing datasets, please describe the dataset and source.
Describe the sampling procedure (e.g. random, snowball, stratified, convenience). Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient. For qualitative data, please indicate whether data saturation was considered, and what criteria were used to decide that no further sampling was needed.
Provide details about the data collection procedure, including the instruments or devices used to record the data (e.g. pen and paper, computer, eye tracker, video or audio equipment) whether anyone was present besides the participant(s) and the researcher, and whether the researcher was blind to experimental condition and/or the study hypothesis during data collection.
Indicate the start and stop dates of data collection. If there is a gap between collection periods, state the dates for each sample cohort.
If no data were excluded from the analyses, state so OR if data were excluded, provide the exact number of exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.
State how many participants dropped out/declined participation and the reason(s) given OR provide response rate OR state that no participants dropped out/declined participation.
If participants were not allocated into experimental groups, state so OR describe how participants were allocated to groups, and if allocation was not random, describe how covariates were controlled.
Briefly describe the study. For quantitative data include treatment factors and interactions, design structure (e.g. factorial, nested, hierarchical), nature and number of experimental units and replicates.
Describe the research sample (e.g. a group of tagged Passer domesticus, all Stenocereus thurberi within Organ Pipe Cactus National Monument), and provide a rationale for the sample choice. When relevant, describe the organism taxa, source, sex, age range and any manipulations. State what population the sample is meant to represent when applicable. For studies involving existing datasets, describe the data and its source.
Note the sampling procedure. Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient.
Describe the data collection procedure, including who recorded the data and how.
Indicate the start and stop dates of data collection, noting the frequency and periodicity of sampling and providing a rationale for these choices. If there is a gap between collection periods, state the dates for each sample cohort. Specify the spatial scale from which the data are taken If no data were excluded from the analyses, state so OR if data were excluded, describe the exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.
Describe the measures taken to verify the reproducibility of experimental findings. For each experiment, note whether any attempts to repeat the experiment failed OR state that all attempts to repeat the experiment were successful.
Describe how samples/organisms/participants were allocated into groups. If allocation was not random, describe how covariates were controlled. If this is not relevant to your study, explain why.
Describe the extent of blinding used during data acquisition and analysis. If blinding was not possible, describe why OR explain why blinding was not relevant to your study.

Palaeontology and Archaeology
Specimen provenance

Specimen deposition
Describe the study conditions for field work, providing relevant parameters (e.g. temperature, rainfall).
State the location of the sampling or experiment, providing relevant parameters (e.g. latitude and longitude, elevation, water depth).
Describe the efforts you have made to access habitats and to collect and import/export your samples in a responsible manner and in compliance with local, national and international laws, noting any permits that were obtained (give the name of the issuing authority, the date of issue, and any identifying information).
Describe any disturbance caused by the study and how it was minimized.
Describe all antibodies used in the study; as applicable, provide supplier name, catalog number, clone name, and lot number.
Describe the validation of each primary antibody for the species and application, noting any validation statements on the manufacturer's website, relevant citations, antibody profiles in online databases, or data provided in the manuscript.
State the source of each cell line used and the sex of all primary cell lines and cells derived from human participants or vertebrate models.
Describe the authentication procedures for each cell line used OR declare that none of the cell lines used were authenticated.
Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.
Name any commonly misidentified cell lines used in the study and provide a rationale for their use.
Provide provenance information for specimens and describe permits that were obtained for the work (including the name of the issuing authority, the date of issue, and any identifying information). Permits should encompass collection and, where applicable, export.
Indicate where the specimens have been deposited to permit free access by other researchers.

March 2021
Dating methods Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.

Ethics oversight
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Animals and other research organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research, and Sex and Gender in Research Laboratory animals

Wild animals
Reporting on sex Field-collected samples

Ethics oversight
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Clinical data Policy information about clinical studies
All manuscripts should comply with the ICMJEguidelines for publication of clinical research and a completedCONSORT checklist must be included with all submissions.

Clinical trial registration
Study protocol

Data collection
Outcomes Dual use research of concern Policy information about dual use research of concern

Hazards
Could the accidental, deliberate or reckless misuse of agents or technologies generated in the work, or the application of information presented in the manuscript, pose a threat to:

No Yes
Public health National security Crops and/or livestock Ecosystems Any other significant area If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are provided.
Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance was required and explain why not.
For laboratory animals, report species, strain and age OR state that the study did not involve laboratory animals.
Provide details on animals observed in or captured in the field; report species and age where possible. Describe how animals were caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released, say where and when) OR state that the study did not involve wild animals.
Indicate if findings apply to only one sex; describe whether sex was considered in study design, methods used for assigning sex. Provide data disaggregated for sex where this information has been collected in the source data as appropriate; provide overall numbers in this Reporting Summary. Please state if this information has not been collected. Report sex-based analyses where performed, justify reasons for lack of sex-based analysis.
For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature, photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.
Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance was required and explain why not.
Provide the trial registration number from ClinicalTrials.gov or an equivalent agency.
Note where the full trial protocol can be accessed OR if not available, explain why.
Describe the settings and locales of data collection, noting the time periods of recruitment and data collection.
Describe how you pre-defined primary and secondary outcome measures and how you assessed these measures.

March 2021
Experiments of concern Does the work involve any of these experiments of concern: No Yes Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.

Data access links
May remain private before publication. The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).

Files in database submission
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.

Methodology
Sample preparation

Instrument
For "Initial submission" or "Revised version" documents, provide reviewer access links. For your "Final submission" document, provide a link to the deposited data.
Provide a list of all files available in the database submission.
Provide a link to an anonymized genome browser session for "Initial submission" and "Revised version" documents only, to enable peer review. Write "no longer applicable" for "Final submission" documents.
Describe the experimental replicates, specifying number, type and replicate agreement.
Describe the sequencing depth for each experiment, providing the total number of reads, uniquely mapped reads, length of reads and whether they were paired-or single-end.
Describe the antibodies used for the ChIP-seq experiments; as applicable, provide supplier name, catalog number, clone name, and lot number.
Specify the command line program and parameters used for read mapping and peak calling, including the ChIP, control and index files used.
Describe the methods used to ensure data quality in full detail, including how many peaks are at FDR 5% and above 5-fold enrichment.
Describe the software used to collect and analyze the ChIP-seq data. For custom code that has been deposited into a community repository, provide accession details.
Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.
Identify the instrument used for data collection, specifying make and model number.